Chapter 19 - More about Natural Language Processing Tools (spaCy)

Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. You have already learned about four basic NLP steps: sentence splitting, ōtokenization, POS-tagging and lemmatization. For all of these, we have used the NLTK library, which is widely used in the field of NLP. However, there are some competitors out there that are worthwhile to have a look at. One of them is spaCy, which is fast and accurate and supports multiple languages.

At the end of this chapter, you will be able to:

work with spaCy
find some additional NLP tools

1. The NLP pipeline

There are many tools and libraries designed to solve NLP problems. In Chapter 15, we have already seen the NLTK library for tokenization, sentence splitting, part-of-speech tagging and lemmatization. However, there are many more NLP tasks and off-the-shelf tools to perform them. These tasks often depend on each other and are therefore put into a sequence; such a sequence of NLP tasks is called an NLP pipeline. Some of the most common NLP tasks are:

Tokenization: splitting texts into individual words
Sentence splitting: splitting texts into sentences
Part-of-speech (POS) tagging: identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
Morphological analysis: separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs)
Stemming: identifying the stems of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'trouble/troubling/troubled'
Lemmatization: identifying the lemmas (dictionary forms) of words in context, such as 'go' for 'go/goes/going/went'
Word Sense Disambiguation (WSD): assigning the correct meaning to words in context
Stop words recognition: identifying commonly used words (such as 'the', 'a(n)', 'in', etc.) in text, possibly to ignore them in other tasks
Named Entity Recognition (NER): identifying people, locations, organizations, etc. in text
Constituency/dependency parsing: analyzing the grammatical structure of a sentence
Semantic Role Labeling (SRL): analyzing the semantic structure of a sentence (who does what to whom, where and when)
Sentiment Analysis: determining whether a text is mostly positive or negative
Word Vectors (or Word Embeddings) and Semantic Similarity: representating the meaning of words as rows of real valued numbers where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days)

You don't always need all these modules. But it's important to know that they are there, so that you can use them when the need arises.

1.1 How can you use these modules?

Let's be clear about this: you don't always need to use Python for this. There are some very strong NLP programs out there that don't rely on Python. You can typically call these programs from the command line. Some examples are:

Treetagger is a POS-tagger and lemmatizer in one. It provides support for many different languages. If you want to call Treetagger from Python, use treetaggerwrapper. Treetagger-python also works, but is much slower.
Stanford's CoreNLP is a very powerful system that is able to process English, German, Spanish, French, Chinese and Arabic. (Each to a different extent, though. The pipeline for English is most complete.) There are also Python wrappers available, such as py-corenlp.
The Maltparser has models for English, Swedish, French, and Spanish.

Having said that, there are many NLP-tools that have been developed for Python:

Natural Language ToolKit (NLTK): Incredibly versatile library with a bit of everything. The only downside is that it's not the fastest library out there, and it lags behind the state-of-the-art.
- Access to several corpora.
- Create a POS-tagger. (Some of these are actually state-of-the-art if you have enough training data.)
- Perform corpus analyses.
- Interface with WordNet.
Pattern: A module that describes itself as a 'web mining module'. Implements a tokenizer, tagger, parser, and sentiment analyzer for multiple different languages. Also provides an API for Google, Twitter, Wikipedia and Bing.
Textblob: Another general NLP library that builds on the NLTK and Pattern.
Gensim: For building vector spaces and topic models.
Corpkit is a module for corpus building and corpus management. Includes an interface to the Stanford CoreNLP parser.
SpaCy: Tokenizer, POS-tagger, parser and named entity recogniser for English, German, Spanish, Portugese, French, Italian and Dutch (more languages in progress). It can also predict similarity using word embeddings.

2. spaCy

spaCy provides a rather complete NLP pipeline: it takes a raw document and performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER). It also supports similarity prediction, but that is outside of the scope of this notebook. The advantage of SpaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages, among which: English, German, Spanish, Portugese, French, Italian and Dutch.

In this notebook, we will show you the basic usage. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides excellent user guides.

2.1 Installing and loading spaCy

To install spaCy, check out the instructions here. On this page, it is explained exactly how to install spaCy for your operating system, package manager and desired language model(s). Simply run the suggested commands in your terminal or cmd. Alternatively, you can probably also just run the following cells in this notebook:



In [ ]:

    
%%bash
conda install -c conda-forge spacy



In [ ]:

    
%%bash
python -m spacy download en

Now, let's first load spaCy. We import the spaCy module and load the English tokenizer, tagger, parser, NER and word vectors.



In [ ]:

    
import spacy
nlp = spacy.load('en') # other languages: de, es, pt, fr, it, nl

nlp is now a Python object representing the English NLP pipeline that we can use to process a text.

EXTRA: Larger models

For English, there are three models ranging from 'small' to 'large':

en_core_web_sm
en_core_web_md
en_core_web_lg

By default, the smallest one is loaded. Larger models should have a better accuracy, but take longer to load. If you like, you can use them instead. You will first need to download them.



In [ ]:

    
#%%bash
#python -m spacy download en_core_web_md



In [ ]:

    
#%%bash
#python -m spacy download en_core_web_lg



In [ ]:

    
# uncomment one of the lines below if you want to load the medium or large model instead of the small one
#nlp = spacy.load('en_core_web_md')  
nlp = spacy.load('en_core_web_lg')

2.2 Using spaCy

Parsing a text with spaCy after loading a language model is as easy as follows:



In [ ]:

    
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

doc is now a Python object of the class Doc. It is a container for accessing linguistic annotations and a sequence of Token objects.

Doc, Token and Span objects

At this point, there are three important types of objects to remember:

A Doc is a sequence of Token objects.
A Token object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations.
A Span object is a slice from a Doc object and a sequence of Token objects.

Since Doc is a sequence of Token objects, we can iterate over all of the tokens in the text as shown below, or select a single token from the sequence:



In [ ]:

    
# Iterate over the tokens
for token in doc:
    print(token)
print()

# Select one single token by index
first_token = doc[0]
print("First token:", first_token)

Please note that even though these look like strings, they are not:



In [ ]:

    
for token in doc:
    print(token, "\t", type(token))

These Token objects have many useful methods and attributes, which we can list by using dir(). We haven't really talked about attributes during this course, but while methods are operations or activities performed by that object, attributes are 'static' features of the objects. Methods are called using parantheses (as we have seen with str.upper(), for instance), while attributes are indicated without parantheses. We will see some examples below.

You can find more detailed information about the token methods and attributes in the documentation.



In [ ]:

    
dir(first_token)

Let's inspect some of the attributes of the tokens. Can you figure out what they mean? Feel free to try out a few more.



In [ ]:

    
# Print attributes of tokens
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

Notice that some of the attributes end with an underscore. For example, tokens have both lemma and lemma_ attributes. The lemma attribute represents the id of the lemma (integer), while the lemma_ attribute represents the unicode string representation of the lemma. In practice, you will mostly use the lemma_ attribute.



In [ ]:

    
for token in doc:
    print(token.lemma(), token.lemma_)

You can also use spacy.explain to find out more about certain labels:



In [ ]:

    
# try out some more, such as NN, ADP, PRP, VBD, VBP, VBZ, WDT, aux, nsubj, pobj, dobj, npadvmod
spacy.explain("VBZ")

You can create a Span object from the slice doc[start : end]. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.



In [ ]:

    
# Create a Span
a_slice = doc[2:5]
print(a_slice, type(a_slice))

# Iterate over Span
for token in a_slice:
    print(token.lemma_, token.pos_)

Text, sentences and noun_chunks

If you call the dir() function on a Doc object, you will see that it has a range of methods and attributes. You can read more about them in the documentation. Below, we highlight three of them: text, sents and noun_chunks.



In [ ]:

    
dir(doc)

First of all, text simply gives you the whole document as a string:



In [ ]:

    
print(doc.text)
print(type(doc.text))

sents can be used to get all the sentences. Notice that it will create a so-called 'generator'. For now, you don't have to understand exactly what a generator is (if you like, you can read more about them online). Just remember that we can use generators to iterate over an object in a fast and efficient way.



In [ ]:

    
# Get all the sentences as a generator 
print(doc.sents, type(doc.sents))

# We can use the generator to loop over the sentences; each sentence is a span of tokens
for sentence in doc.sents:
    print(sentence, type(sentence))

If you find this difficult to comprehend, you can also simply convert it to a list and then loop over the list. Remember that this is less efficient, though.



In [ ]:

    
# You can also store the sentences in a list and then loop over the list 
sentences = list(doc.sents)
for sentence in sentences:
    print(sentence, type(sentence))

The benefit of converting it to a list is that we can use indices to select certain sentences. For example, in the following we only print some information about the tokens in the second sentence.



In [ ]:

    
# Print some information about the tokens in the second sentence.
sentences = list(doc.sents)
for token in sentences[1]:
    data = '\t'.join([token.orth_,
                      token.lemma_,
                      token.pos_,
                      token.tag_,
                      str(token.i),    # Turn index into string
                      str(token.idx)]) # Turn index into string
    print(data)

Similarly, noun_chunks can be used to create a generator for all noun chunks in the text.



In [ ]:

    
# Get all the noun chunks as a generator 
print(doc.noun_chunks, type(doc.noun_chunks))

# You can loop over a generator; each noun chunk is a span of tokens
for chunk in doc.noun_chunks:
    print(chunk, type(chunk))
    print()

Named Entities

Finally, we can also very easily access the Named Entities in a text using ents. As you can see below, it will create a tuple of the entities recognized in the text. Each entity is again a span of tokens, and you can access the type of the entity with the label_ attribute of Span.



In [ ]:

    
# Here's a slightly longer text, from the Wikipedia page about Harry Potter.
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

doc = nlp(harry_potter)
print(doc.ents)
print(type(doc.ents))



In [ ]:

    
# Each entity is a span of tokens and is labeled with the type of entity
for entity in doc.ents:
    print(entity, "\t", entity.label_, "\t", type(entity))

Pretty cool, but what does NORP mean? Again, you can use spacy.explain() to find out:

3. EXTRA: Stanford CoreNLP

Another very popular NLP pipeline is Stanford CoreNLP. You can use the tool from the command line, but there are also some useful Python wrappers that make use of the Stanford CoreNLP API, such as pycorenlp. As you might want to use this in the future, we will provide you with a quick start guide. To use the code below, you will have to do the following:

Download Stanford CoreNLP here.
Install pycorenlp (run pip install pycorenlp in your terminal, or simply run the cell below).
Open a terminal and run the following commands (replace with the correct directory names):
cd LOCATION_OF_CORENLP/stanford-corenlp-full-2018-02-27
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
This step you will always have to do if you want to use the Stanford CoreNLP API.



In [ ]:

    
#%%bash
#pip install pycorenlp



In [ ]:

    
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

Next, you will want to define which annotators to use and which output format should be produced (text, json, xml, conll, conllu, serialized). Annotating the document then is very easy. Note that Stanford CoreNLP uses some large models that can take a long time to load. You can read more about it here.



In [ ]:

    
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

# Define annotators and output format
properties= {'annotators': 'tokenize, ssplit, pos, lemma, parse',
             'outputFormat': 'json'}

# Annotate the string with CoreNLP
doc = nlp.annotate(harry_potter, properties=properties)

In the next cells, we will simply show some examples of how to access the linguistic annotations if you use the properties as shown above. If you'd like to continue working with Stanford CoreNLP in the future, you will likely have to experiment a bit more.



In [ ]:

    
doc.keys()



In [ ]:

    
sentences = doc["sentences"]
first_sentence = sentences[0]
first_sentence.keys()



In [ ]:

    
first_sentence["parse"]



In [ ]:

    
first_sentence["basicDependencies"]



In [ ]:

    
first_sentence["tokens"]



In [ ]:

    
for sent in doc["sentences"]:
    for token in sent["tokens"]:
        word = token["word"]
        lemma = token["lemma"]
        pos = token["pos"]
        print(word, lemma, pos)



In [ ]:

    
# find out what the entity label 'NORP' means
spacy.explain("NORP")

4. NLTK vs. spaCy vs. CoreNLP

There might be different reasons why you want to use NLTK, spaCy or Stanford CoreNLP. There are differences in efficiency, quality, user friendliness, functionalities, output formats, etc. At this moment, we advise you to go with spaCy because of its ease in use and high quality performance.

Here's an example of both NLTK and spaCy in action.

The example text is a case in point. What goes wrong here?
Try experimenting with the text to see what the differences are.



In [ ]:

    
import nltk
import spacy

nlp = spacy.load('en')



In [ ]:

    
text = "I like cheese very much"

print("NLTK results:")
nltk_tagged = nltk.pos_tag(text.split())
print(nltk_tagged)

print()

print("spaCy results:")
doc = nlp(text)
spacy_tagged = []
for token in doc:
    tag_data = (token.orth_, token.tag_,)
    spacy_tagged.append(tag_data)
print(spacy_tagged)

Do you want to learn more about the differences between NLTK, spaCy and CoreNLP? Here are some links:

5. Some other useful modules for cleaning and preprocessing

Data is often messy, noisy or includes irrelevant information. Therefore, chances are big that you will need to do some cleaning before you can start with your analysis. This is especially true for social media texts, such as tweets, chats, and emails. Typically, these texts are informal and notoriously noisy. Normalising them to be able to process them with NLP tools is a NLP challenge in itself and fully discussing it goes beyond the scope of this course. However, you may find the following modules useful in your project:

tweet-preprocessor: This library makes it easy to clean, parse or tokenize the tweets. It supports cleaning, tokenizing and parsing of URLs, hashtags, reserved words, mentions, emojis and smileys.
emot: Emot is a python library to extract the emojis and emoticons from a text (string). All the emojis and emoticons are taken from a reliable source, i.e. Wikipedia.org.
autocorrect: Spelling corrector (Python 3).
html: Can be used to remove HTML tags.
chardet: Universal encoding detector for Python 2 and 3.
ftfy: Fixes broken unicode strings.

If you are interested in reading more about these topic, these papers discuss preprocessing and normalization:

Assessing the Consequences of Text Preprocessing Decisions (Denny & Spirling 2016). This paper is a bit long, but it provides a nice discussion of common preprocessing steps and their potential effects.
What to do about bad language on the internet (Eisenstein 2013). This is a quick read that we recommend everyone to at least look through.

And here is a nice blog about character encoding.

Exercises



In [ ]:

    
import spacy
nlp = spacy.load('en')

Exercise 1:

What is the difference between token.pos_ and token.tag_? Read the docs to find out.
What do the different labels mean? Use space.explain to inspect some of them. You can also refer to this page for a complete overview.



In [ ]:

    
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")
for token in doc:
    print(token.pos_, token.tag_)



In [ ]:

    
spacy.explain("PRON")

Exercise 2:

Let's practice a bit with processing files. Open the file charlie.txt for reading and use read() to read its content as a string. Then use spaCy to annotate this string and print the information below. Remember: you can use dir() to remind yourself of the attributes.

For each token in the text:

Text
Lemma
POS tag
Whether it's a stopword or not
Whether it's a punctuation mark or not

For each sentence in the text:

The complete text
The number of tokens
The complete text in lowercase letters
The text, lemma and POS of the first word

For each noun chunk in the text:

The complete text
The number of tokens
The complete text in lowercase letters
The text, lemma and POS of the first word

For each named entity in the text:

The complete text
The number of tokens
The complete text in lowercase letters
The text, lemma and POS of the first word



In [ ]:

    
filename = "../Data/Charlie/charlie.txt"

# read the file and process with spaCy



In [ ]:

    
# print all information about the tokens



In [ ]:

    
# print all information about the sentences



In [ ]:

    
# print all information about the noun chunks



In [ ]:

    
# print all information about the entities

Exercise 3:

Remember how we can use the os and glob modules to process multiple files? For example, we can read all .txt files in the dreams folder like this:



In [ ]:

    
import glob
filenames = glob.glob("../Data/dreams/*.txt")
print(filenames)

Now create a function called get_vocabulary that takes one positional parameter filenames. It should read in all filenames and return a set called unique_words, that contains all unique words in the files.



In [ ]:

    
def get_vocabulary(filenames):
    # your code here

# test your function here
unique_words = get_vocabulary(filenames)
print(unique_words, len(unique_words))
assert len(unique_words) == 415 # if your code is correct, this should not raise an error

Exercise 4:

Create a function called get_sentences_with_keyword that takes one positional parameter filenames and one keyword parameter filenames with default value None. It should read in all filenames and return a list called sentences that contains all sentences (the complete texts) with the keyword.

Hints:

It's best to check for the lemmas of each token
Lowercase both your keyword and the lemma



In [ ]:

    
import glob
filenames = glob.glob("../Data/dreams/*.txt")
print(filenames)



In [ ]:

    
def get_sentences_with_keyword(filenames, keyword=None):
    #your code here

# test your function here
sentences = get_sentences_with_keyword(filenames, keyword="toy")
print(sentences)
assert len(sentences) == 4 # if your code is correct, this should not raise an error